-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049
base: master
Are you sure you want to change the base?
Conversation
…om branch kantvai-ggmlqnn-npurpc, https://github.com/kantv-ai/llama.cpp/wiki/offloading-mulmat-to-QNN-backend)
…om branch kantvai-ggmlqnn-npurpc, https://github.com/kantv-ai/llama.cpp/wiki/offloading-mulmat-to-QNN-backend)
…omplex/redundant pointer operation
a simple tech doc: mapping ggml compute graph to QNN compute graph with the breakthrough help from chiwwang@QTI on April 2024,
I already found that there are different technical paths to utilize the Qualcomm Hexagon NPU in ggml-qnn via QNN SDK:
prons: this approach can benefit greatly from the excellent "backend scheduler" feature in the ggml backend subsystem, can be a "functional implementation" or a good starting-point in the upstream llama.cpp community. accordingly, this approach can be verified easily with my self-made script build-run-android.sh cons: there mightbe performance concern in ggml-qnn backend
prons: this approach might be equivalent to the principle shown in the above quoted code, and I guess that's the secret of how to utilize the Hexagon NPU maximally in QNN backend. I don't know why there is such big difference between ggml-qnn and ggml-sycl/ggml-cann/ggml-opencl. cons: can not take advantage of backend scheduler feature and too much work load there are many undocumented(or not very clear) technical details in QNN SDK, so I think the necessary technical support should be provided from Qualcomm's tech team even I reach the final mission according to the first approach with help from the great llama.cpp community.
correction from domain technical experts is greatly welcomed and appricated. |
How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. |
Hi @oreomaker , Nice question! This is actually the key point for similar QNN backend implementations. QNN's interface functions more like a traditional ML framework that requires "compilation" (what they call "finalize") before execution. In my fork, I've taken this a step further by generating a QNN graph based on the GGML graph. This approach allows the QNN framework to perform more comprehensive optimizations during compilation. |
I'm a little curious of that whether you are a regular employee from Qualcomm Shenzhen branch. as I said before many times, you can submit your standalone PR and I personally like to see your success in this community, but pls don't brings some intentional challenge comments again and again in my PR:
thanks so much! btw, I personally don't think you are a regular employee from Qualcomm because your behavior breaks many default rules and Qualcomm's top-talent regular employee don't do that. |
I don't want to continue debating this here. The reason for restricting your access to this repo was your inappropriate comments on unrelated PRs (here, here2 and here3). And repo's owner gave a clear reason about it. I'd suggest focusing on improving your codebase in an objective manner without making assumptions about or judging others' work. If my comments made you uncomfortable, I apologize. I'm happy to step back from this discussion. I can also create my own PR where anyone interested can discuss the design approach more constructively. |
as a very old programmer, as I said before: I have no intention of getting involved in a meaningless competition between me and you or your team and I'd like to see your success in this community. what you did seems you are a PUA master(offered an unacceptable help firstly, then angered the other person, then use other's hand to punish other person, at last achieve your purpose). I don't understand why you spent efforts to study my comments in this community? I already admitted my mistake in last year at here. |
thanks for your comment although I see you are the code reviewer of a similar PR from a CN programmer chracac and I wish your success in this community firstly. this is a good question, your concern is correct:
[updated on 02/26/2025] my previous answer might be wrong, because the first technical approach can works very good(quantized data with mulmat didn't implemented when I wrote that simple tech doc), there are 10x-13x performance improvements in my local dev envs with QNN backend. btw, you can refer to my personal understanding about ggml-qnn and other ggml backends in that simple tech doc: prons: this approach might be equivalent to the principle shown in the above quoted code, and I guess that's the secret of how to utilize the Hexagon NPU maximally in QNN backend. I don't know why there is such big difference between ggml-qnn and ggml-sycl/ggml-cann/ggml-opencl. you will understand what I said there if you spent some time to study Huawei's CANN or Intel's sycl. btw, I have strong media background about FFmpeg & OpenMAX IL / MediaCodec or other hardware acceleration software stack although I know nothing about real hard-core AI tech. |
No description provided. |
* [ ] Low
* [x] Medium
* [ ] High
PR Description
this PR is a continued effort of my original PR #6869
thanks to the huge changes in the software architecture in the latest llama.cpp (especially the maturation of the "backend scheduler" feature),
this implementation put main logic in one single source file(ggml-qnn.cpp) because it will helpful for other experienced programmers be involved in dev activity which similar to what ggerganov did in the very beginning of ggml.c/llama.cpp or what Intel did in the very beginning of ggml-sycl.cpp or what Qualcomm did in the ggml-opencl.cpp. another main reason of this coding style is I think this will make the developers' workflow more easily:
this implementation is a concise implementation and focus on the final mission "how to utilize the Hexagon NPU maximally", this implementation is not cool(lack of some cool modern C++ features such as lambda expression and complex/complicated C++ encapsulation), but I hope every domain programmers can understand codes and domain technical details easily and quickly and it can be considered an open source reference implementation of ggml-qnn and also can/might be used in production project . I think this PR will be helpful to the llama.cpp community although this PR might be not accepted.
after spent too much efforts on ggml-qnn backend, I personally think a fully open-source ggml-qnn backend might be a team-work between experienced software programmers and AI experts even the professional technical help from Qualcomm. in other words, I personally think NO single independent programmer or independent development team can provide a fully implementation of ggml-qnn backend, because experts and programmers in this community should haven't seen a programmer who is familiar with both Android system software programming and Windows system software programming, and is proficient in hard-core AI technology and Qualcomm QNN SDK, one more important thing, familiar with source code of ggml/llama.cpp although he/she is not an AI expert.
Big picture of ggml-qnn backend
pls refer to a simple tech doc below: mapping ggml compute graph to QNN compute graph
the first technical approach can be seen in this PR. accordingly, the second technical approach can be easily extended base on this PR with the similar coding style or complex/complicated C++ encapsulation.
What I did in my first PR and this PR
all above items can be found in project KanTV if there is no specified notes and project KanTV is a device-AI learning project and heavily depend on ggml/whisper.cpp/llama.cpp. I personally think the rest parts of ggml-qnn is a team work between AI experts and experienced programmers. we can work together to achieve the final goal: provide a productive-level open-source ggml-qnn backend to llama.cpp community.
Performance of ggml-qnn backend
performance is the key-point I heavily/always concerns on Android or other embedded system. here is the result in local dev envs of how I solve this performance issue at the early phase(because this ggml-qnn backend lacks of many ops although it's a really functional backend) of ggml-qnn backend:
before finetune:
Fig-1:llama-bench with QNN NPU backend(Hexagon NPU) on Xiaomi14

after finetune:

Fig-2: llama-bench with QNN NPU backend(Hexagon NPU) on Xiaomi14, some mulmat operations has been offloaded to NPU
Fig-3:llama-bench with ggml cpu backend("3" is a "fake" ggml-qnn backend which used to compare performance between QNN NPU backend and ggml cpu bakcned)) on Xiaomi14(AI experts can explain why there is so big difference of the second data between Fig-2 and Fig-3, this would be helpful for performance fine-tuning in this PR)

How to setup dev envs on a Linux machine
Ubuntu 20.04,22.04 is validated and recommended, other Linux distributions might be also ok.
the dev activity in this PR can be done in pure command line without any IDE, so setup dev envs is simple:
How to build ggml‐qnn source code for Android and verify ggml-qnn backend on Snapdragon based phone
we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-qnn". for programmers, we can use "adb logcat | grep ggml-qnn" to help troubleshooting work.
How to build ggml‐qnn source code for Snapdragon based WoA(Windows on ARM) device
I have nothing knowledge about Windows programming although the source code of ggml/llama.cpp are both portable codes and Qualcomm's highly-well designed QNN SDK is also portable.
I know @chraac 's team has done them from a loop in this community. I think it can be migrated to this with help from his team or by me manually after I carefully check the implementation of Windows port from his team.
Acknowledgement
AI-assisted programming for ggml-qnn backend
recently I tried AI-assisted programming for ggml-qnn backend with the help from the powerful Grok 3(I tried DeepSeek-R1 but failed, I personally think Grok 3 is similar/closer to a natural human domain expert after it really helped me a lot in this PR, my English level can't express my real feeling of Grok 3 and we can try Grok 3).
@ggerganov @slaren thanks to your outstanding "backed-schedualr" feature, this ggml-qnn backend now is a functional QNN backend and performance of LLM inference on Snaprdragon 8 Gen3 is greatly improved(10x-13x). pls have a look when you have time, it will be helpful for other domain experts and AI experts be involved in dev activities if this PR can be accepted: there are many QNN SDK API assembling and calling in the rest parts of ggml-qnn regardless of C style or C++ style, and the rest parts of ggml-qnn is not easy and much workloads and efforts are required for a real product, especially the team work between AI experts and experienced programmers. I know Qualcomm's experts already participate in llama.cpp's dev activities, might be they can help to do code review. thanks so much.